dataset: [ADD] 50 Vietnamese dataset from vn-mteb by BaoLocPham · Pull Request #2964 · embeddings-benchmark/mteb

BaoLocPham · 2025-08-01T14:39:20Z

I have outlined why this dataset is filling an existing gap in mteb
I have tested that the dataset runs with the mteb package.

import mteb
# sample model:
model = mteb.get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

task = mteb.get_task("{name of your task}")
evaluation = mteb.MTEB(tasks=[task])
evaluation.run(model)

I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)

Task Name	Task Type
ArguaAna-VN	Retrieval
SciFact-VN	Retrieval
ClimateFEVER-VN	Retrieval
FEVER-VN	Retrieval
DBPedia-VN	Retrieval
NQ-VN	Retrieval
HotpotQA-VN	Retrieval
MSMARCO-VN	Retrieval
TRECCOVID-VN	Retrieval
FiQA2018-VN	Retrieval
NFCorpus-VN	Retrieval
SCIDOCS-VN	Retrieval
Touche2020-VN	Retrieval
Quora-VN	Retrieval
CQADupstackAndroid-VN	Retrieval
CQADupstackGis-VN	Retrieval
CQADupstackMathematica-VN	Retrieval
CQADupstackPhysics-VN	Retrieval
CQADupstackProgrammers-VN	Retrieval
CQADupstackStats-VN	Retrieval
CQADupstackTex-VN	Retrieval
CQADupstackUnix-VN	Retrieval
CQADupstackWebmasters-VN	Retrieval
CQADupstackWordpress-VN	Retrieval
Total Retrieval Tasks: 24	--
--	--
Banking77VNClassification	Classification
EmotionVNClassification	Classification
AmazonCounterfactualVNClassification	Classification
MTOPDomainVNClassification	Classification
TweetSentimentExtractionVNClassification	Classification
ToxicConversationsVNClassification	Classification
ImdbVNClassification	Classification
MTOPIntentVNClassification	Classification
MassiveScenarioVNClassification	Classification
MassiveIntentVNClassification	Classification
AmazonReviewsVNClassification	Classification
AmazonPolarityVNClassification	Classification
Total Classification Tasks: 12	--
--	--
SprintDuplicateQuestions-VN	Pair Classification
TwitterSemEval2015-VN	Pair Classification
TwitterURLCorpus-VN	Pair Classification
Total Pair Classification Tasks: 3	--
--	--
TwentyNewsgroupsClustering-VN	Clustering
RedditClusteringP2P-VN	Clustering
StackExchangeClusteringP2P-VN	Clustering
StackExchangeClustering-VN	Clustering
RedditClustering-VN	Clustering
Total Clustering Tasks: 5	--
--	--
SciDocsRR-VN	Reranking
AskUbuntuDupQuestions-VN	Reranking
StackOverflowDupQuestions-VN	Reranking
--	--
Total Reranking Tasks: 3	--
BIOSSES-VN	STS
SICK-R-VN	STS
STSBenchmark-VN	STS
Total STS Tasks: 3	--
--	--

All of the dataset is taken from our work from this paper, this is the preprint citation of our dataset.

@misc{pham2025vnmtebvietnamesemassivetext,
      title={VN-MTEB: Vietnamese Massive Text Embedding Benchmark}, 
      author={Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
      year={2025},
      eprint={2507.21500},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.21500}, 
}

I'll update the models results after this PR to create a new benchmark for VN-MTEB

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

Samoed · 2025-08-02T12:58:20Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+
+from mteb.abstasks.TaskMetadata import TaskMetadata
+
+from ....abstasks import AbsTaskClassification, MultilingualTask


I think your tests are failing, because you need to import from

Suggested change

from ....abstasks import AbsTaskClassification, MultilingualTask

from mteb.abstasks.AbsTaskClassification import AbsTaskClassification

from mteb.abstasks.TaskMetadata import TaskMetadata

KennethEnevoldsen

Hei happy to see the PR and congratulations on the release.

I know that the paper is already out, but I was a bit sad to see that you only use machine-translated datasets (though the verification pipeline does help a lot).

if you want to make a v2 of the benchmark then it might be ideal to use some of the native datasets in mteb, you can see that there is at 26 available:

import mteb
tasks = mteb.get_tasks(languages=["vie"])

tasks= [t for t in tasks if t.metadata.sample_creation != "machine-translated"]

len(tasks) # 26

Can I also ask you to compute the metrics using:

mteb.get_task(...)
task.calculate_metadata_metrics()

KennethEnevoldsen · 2025-08-02T17:05:59Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+        dialect=[],
+        sample_creation="machine-translated",
+        socioeconomic_status=None,
+        text_creation=None,


Test fails as many of metadata fields are not specified. Do ask if there is question on how to fill them out

KennethEnevoldsen · 2025-08-02T17:06:30Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+        annotations_creators="derived",
+        dialect=[],
+        sample_creation="machine-translated",
+        socioeconomic_status=None,


Suggested change

socioeconomic_status=None,

No longer used

KennethEnevoldsen · 2025-08-02T17:09:34Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+class AmazonCounterfactualVNClassification(AbsTaskClassification):
+    metadata = TaskMetadata(


Suggested change

class AmazonCounterfactualVNClassification(AbsTaskClassification):

metadata = TaskMetadata(

class AmazonCounterfactualVNClassification(AbsTaskClassification):

num_samples = 32

n_experiments = 10

metadata = TaskMetadata(

I thought abot this too, but n_experiments can't be passed like this

mteb/mteb/abstasks/AbsTaskClassification.py

Line 96 in e4f30e9

else self.metadata_dict.get("n_experiments", 10)

But 10 is default value, so it can be removed

KennethEnevoldsen · 2025-08-02T17:11:06Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+    )
+
+    @property
+    def metadata_dict(self) -> dict[str, str]:


can be deleted (see comment above)

KennethEnevoldsen · 2025-08-02T17:11:53Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+            "revision": "b48bc27d383cfca5b6a47135a52390fa5f66b253"
+        },
+        description=(
+            "A collection of Amazon customer reviews annotated for counterfactual detection pair classification."


Please also add a description of how it was machine translated, and that it was adapted from AmazonCounterfactualClassification.

KennethEnevoldsen · 2025-08-02T17:12:05Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+        eval_langs=["vie-Latn"],
+        main_score="accuracy",
+        date=("2025-07-29", "2025-07-30"),
+        form=None,


Suggested change

form=None,

KennethEnevoldsen · 2025-08-02T17:19:06Z

mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py

+        license="cc-by-sa-4.0",
+        annotations_creators="derived",
+        dialect=[],
+        sample_creation="machine-translated",


I would make this "machine-translated and LM verified," given the pipeline. I would also describe the verification process in the description.

BaoLocPham · 2025-08-04T14:26:47Z

hi @Samoed, can you have a quick check why on the tests/test_evaluators/test_RerankingEvaluator.py give an assertion error? Idk why it's happen on my side.

Samoed · 2025-08-04T14:37:44Z

Currenly in logs there is no error in the tests/test_evaluators/test_RerankingEvaluator.py. The only error, that you need to update this test, because you've added a new task

FAILED tests/test_models/test_model_meta.py::test_model_similar_tasks[training_datasets0] - AssertionError: assert ['NanoTouche2...Retrieval.v3'] == ['NanoTouche2...Retrieval.v3']
  
  At index 4 diff: 'Touche2020-VN' != 'Touche2020Retrieval.v3'
  Left contains one more item: 'Touche2020Retrieval.v3'
  
  Full diff:
    [
        'NanoTouche2020Retrieval',
        'Touche2020',
        'Touche2020-Fa',
        'Touche2020-NL',
  +     'Touche2020-VN',
        'Touche2020Retrieval.v3',
    ]

BaoLocPham · 2025-08-04T16:16:18Z

Hi @KennethEnevoldsen , @Samoed, thanks for your constructive commend, I added new SAMPLE_CREATION_METHOD -> "machine-translated and LM verified" in mteb/abstasks/TaskMetadata.py. All of 50 new dataset follow this sample_creation. Thank you.

KennethEnevoldsen

There are still comments that haven't yet been resolved. Please take another look at these

BaoLocPham · 2025-08-08T04:17:22Z

There are still comments that haven't yet been resolved. Please take another look at these

I already updated the code base on the comments you gave. Please have a look

KennethEnevoldsen · 2025-08-09T14:29:08Z

Thanks! Great to have these merged

* model: add image support for jina embeddings v4 (#2893) * feat: unify text and image embeddings for all tasks * fix: uniform batch size * fix: update error message * fix: update code task * fix: update max length * fix: apply review suggestions * model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889) * feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> * Add Classification Evaluator unit test (#2838) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: update colpali engine models (#2905) * adding vidore benchmarks * fix typo * clean vidore names + per lang eval * lint * vidore names * bibtex fix * fix revision * vidore v2 citation * update citation format and fix per-language mappings * lint: citations * typo citations * fix revisiions * lint * fix colnomic3b revision * fix colqwen2.5 revision + latest repo version * fix query agmentation tokens * colsmol revision * 1.38.35 Automatically generated by python-semantic-release * Evaluator tests (#2910) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Adding STSEvaluator and SummarizationEvaluator tests * Correcting due to the comments * Correcting due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Classification dataset cleaning (#2900) * Classification dataset cleaning * Update pull request number * Fix metadata test * fix formatting * add script for cleaning * Update tasks & benchmarks tables * dataset: Add JapaneseSentimentClassification (#2913) Add JapaneseSentimentClassification * Update tasks & benchmarks tables * fix: change `passage` prompt to `document` (#2912) * change document to passage * fix prompt names * fix kwargs check * fix default prompt * 1.38.36 Automatically generated by python-semantic-release * model: Add OpenSearch inf-free sparse encoding models (#2903) add opensearch inf-free models Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * dataset: add BarExamQA dataset (#2916) * Add BareExamQA retrieval task * ran linter * updated details * updated details * fixed subtype name * fixed changes * ran linter again * Use `mteb.get_model` in adding_a_dataset.md (#2922) Update adding_a_dataset.md * fix: specify revision for opensearch (#2919) specify revision for opensearch * 1.38.37 Automatically generated by python-semantic-release * Update the link for gemini-embedding-001 (#2928) * fix: replace with passage (#2934) * fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940) * fix: Only import SparseEncoder once sentence-transformer version have been checked fixes #2936 * Update mteb/models/opensearch_neural_sparse_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939) The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue. * docs: Update adding_a_dataset.md (#2947) * docs: Update adding_a_dataset.md * Update docs/adding_a_dataset.md * ci: bump semantic release * 1.38.38 Automatically generated by python-semantic-release * dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935) * BSARD loader fixed * BSARDv2 metadata fixed * Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * dataset: add GovReport dataset (#2953) * Added govreport task * Updated description * dataset: add BillSum datasets (#2943) * Added BillSum datasets * fixed billsumca * Updated BillSumCA description * Updated BillSumUS description * Update mteb/tasks/Retrieval/eng/BillSumCA.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/BillSumUS.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * lint * lint --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716) * Add RuSciBench * fix bitext mining lang * Add regression task * fix init * add missing files * Improve description * Add superseded_by * fix lint * Update regression task to match with v2 * Add stratified_subsampling for regression task * Add boostrap for regression task * Rename task class, add model as evaluator argument * fix import * fix import 2 * fixes * fix * Rename regression model protocol * Update tasks & benchmarks tables * 1.38.39 Automatically generated by python-semantic-release * qzhou-embedding model_meta & implementation (#2975) * qzhou-embedding model_meta & implementation * Update qzhou_models.py * Update qzhou_models.py Processing todo items（Add default instruction） * Update qzhou_models.py correct bge datalist * Update qzhou_models.py correct 'public_training_data' * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update mteb/models/qzhou_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/qzhou_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * format qzhou_models.py for ruff check --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * model: Add Voyage 3.5 model configuration (#3005) Add Voyage 3.5 model configuration - Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens - Set release date to 2025-01-21 with revision 1 - Configure for cosine similarity with instruction support - Include standard Voyage training datasets reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com> * model: BAAI/bge-m3-unsupervised Model (#3007) * Add BAAI/bge-m3-unsupervised Model (BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out) * Remove the commented retromae model --------- Co-authored-by: fzowl <zoltan@voyageai.com> * lint: Correcting lint errors (#3004) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Correcting the lint errors --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * dataset: Added 50 Vietnamese dataset from vn-mteb (#2964) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * model: Add Cohere embed-v4.0 model support (#3006) * Add Cohere embed-v4.0 model support - Add text-only embed-v4.0 model in cohere_models.py - Add multimodal embed-v4.0 model in cohere_v.py - Support configurable dimensions (256, 512, 1024, 1536) - Support 128,000 token context length - Support multimodal embedding (text, images, mixed PDFs) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Add Cohere embed-v4.0 model support Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * Add OpenAI models with 512 dimension (#3008) * Add OpenAI/text-embedding-3-small (512 dim) Add OpenAI/text-embedding-3-large (512 dim) * Correcting due to comments --------- Co-authored-by: fzowl <zoltan@voyageai.com> * Standardise task names and fix citation formatting (#3026) fixes for name formatting * Update tasks & benchmarks tables * fix: Add missing training sets for qzhou (#3023) * Supplement missing training sets * reformat code * Reorganize the data list format * update qzhou_model meta * 1.38.40 Automatically generated by python-semantic-release * model: Add samilpwc_models meta (#3028) * model: Add samilpwc_models meta * Fix: Remove CONST * Fix: Reformat File * Update: model revision * model: Add granite-vision-embedding model (#3029) * Add files via upload * Address review comments * Address review comments * ruff format * Update mteb/models/granite_vision_embedding_models.py * lint error fix --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: incorrect revision for SNLRetrieval (#3033) The provided revisions doesn't seem to be present on: adrlau/navjordj-SNL_summarization_copy Replacing with latest revision * dataset: Add HumanEvalRetrieval task (#3022) * Add HumanEvalRetrieval dataset * Fix TaskMetadata structure and remove descriptive_stats - Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure * Fix dataset path and use verified metadata - Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv * Add correct revision hash to HumanEval - Add revision hash: ed1f48a for reproducibility * Fix HumanEval metadata validation - Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure * Address reviewer feedback - Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility * Fix field names in HumanEval dataset loading Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format. * Fix deprecated metadata_dict usage Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility. * Fix data structure for MTEB compatibility - Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility * Address PR feedback for HumanEval dataset - Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly * Fix BibTeX citation formatting for HumanEvalRetrieval - Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation * Update tasks & benchmarks tables * 1.38.41 Automatically generated by python-semantic-release * ci: reduce parallel runs for when checking if a dataset exists (#3035) The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831) * ci: Updating rerun delays to prevent false positives errors * ci: Updating rerun delays to prevent false positives errors * model: Add GreenNode Vietnamese Embedding models (#2994) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * model: add granite-embedding-english R2 models (#3050) * fix: Updated revision for jina-embeddings-v4 (#3046) * fix: jinav4 revision Signed-off-by: admin <bo.wang@jina.ai> * change revision instead of removing it Signed-off-by: admin <bo.wang@jina.ai> --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: admin <bo.wang@jina.ai> * 1.38.42 Automatically generated by python-semantic-release * Fix 3 VN-MTEB Pair Classification tasks (#3053) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * [FIX] VN-MTEB 3 datasets PairClassification rename column * dataset: Add mbpp retrieval (#3037) * Add MBPP retrieval task - Code retrieval task based on 378 Python programming problems - Natural language queries matched to Python code implementations - Uses python-Code evaluation language for code-specific metrics - Includes proper citations and descriptive statistics * Add MBPPRetrieval to imports * Add descriptive statistics for MBPPRetrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * dataset: Added wikisql retrieval (#3039) * Add WikiSQL retrieval task - Code retrieval task based on WikiSQL natural language to SQL dataset - Natural language questions matched to SQL query implementations - Uses sql-Code evaluation language for SQL-specific metrics - Includes proper citations and descriptive statistics * Add WikiSQLRetrieval to imports * Add descriptive statistics for WikiSQLRetrieval * Reformatting * Reformatting * Reformatting, correcting the revision * Update tasks & benchmarks tables * ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors try to fix CI * fix MBPPRetrieval revision (#3055) Update MBPPRetrieval.py Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Add VN-MTEB benchmark and Leaderboard (#2995) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] VN-MTEB benchmark and leaderboard * [FIX] wrong benchmark name * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * 1.38.43 Automatically generated by python-semantic-release * Add hc3finance retrieval (#3041) * Add HC3Finance retrieval task - Financial retrieval task based on HC3 Finance dataset - Financial questions matched to human and AI-generated content - Covers financial explanations, analysis, and educational content - Includes proper citations and descriptive statistics * Add HC3FinanceRetrieval to imports * Add descriptive statistics for HC3FinanceRetrieval * Reformatting * Reformatting, correcting the revision * Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Add finqa retrieval (#3042) * Add FinQA retrieval task - Financial numerical reasoning retrieval task based on FinQA dataset - Numerical financial questions matched to relevant document data - Covers earnings reports with tables and quantitative financial data - Includes proper citations and descriptive statistics * Add FinQARetrieval to imports * Add descriptive statistics for FinQARetrieval * Reformatting * Reformatting * Update mteb/tasks/Retrieval/eng/FinQARetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FinanceBenchRetrieval task (#3044) * Add FinanceBenchRetrieval * Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FreshStackRetrieval task (#3043) * Add FreshStackRetrieval * Reformatting, correcting the revision * Dataset correction * Update tasks & benchmarks tables * dataset: Add ds1000 retrieval (#3038) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * Add ChatDoctorRetrieval (#3045) * Add ChatDoctorRetrieval * Reformatting, correcting the revision * Correct the dataset citation * Correcting due to comments * Update tasks & benchmarks tables * Correcting the (new) DS1000 dataset's revision (#3063) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Add DS1000Retrieval task implementation * dataset: Add JinaVDR (#2942) * feat: added jinavdr benchmark * feat: added description for jinavdr * feat: fixed licenses and added bibtex * feat: made jinav4 compatible with vidore benchmark * feat: corrected query numbers * feat: removed print * feat: added max pixel argument for jina models * feat: score calculation on cpu * feat: adjust jina model for new mteb code * feat: code cleanup * feat: corrected bibtex * feat: make colpali run with jinavdr * feat: fixed comments * feat: better reference and fixed comments * feat: added date for tasks * feat: fixed missing metadata and bibtex * feat: added descriptions per dataset * Update tasks & benchmarks tables * model: Add CoDi-Embedding-V1 (#3054) * add codiemb-minicpm * replace codiemb_minicpm with codi_model * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update code * update code * reformat --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: ensure that there are always relevant docs attached to query (#3058) * fix: ensure that there are always relevant docs attached to query Here is brief test that it doesn't influence scores: ```py t1 = mteb.get_task("TwitterHjerneRetrieval") meta = mteb.get_model_meta("minishlab/potion-base-2M") eval = mteb.MTEB(tasks=[t1]) res = eval.run(model=meta.load_model()) # before fix: res[0].get_score() # np.float64(0.02837) res[0].scores before_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # with update: res[0].get_score() # np.float64(0.02837) res[0].scores with_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # check with_fix == before_fix # True * restructure * format * relax pytrec versions * fix incorrect parsing * 1.38.44 Automatically generated by python-semantic-release * Correcting the JINA models with SentenceTransformerWrapper (#3071) * ci: Add stale workflow (#3066) * add stale workflow * add permissions * add bug label to bug issue template * revert bug issue and only look at more info needed issues * more accurate name * override default * fix: open_clip package validation (#3073) * 1.38.45 Automatically generated by python-semantic-release * fix: Update revision for qzhou models (#3069) * 1.38.46 Automatically generated by python-semantic-release * Fix the reference link for CoDi-Embedding-V1 (#3075) Fix reference link * rename passage to document * format --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai> Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com> Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp> Co-authored-by: zhichao-aws <zhichaog@amazon.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com> Co-authored-by: Feiyang <feiyangc@google.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Nikolay Banar <nikc20008@gmail.com> Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: fzowl <zoltan@voyageai.com> Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com> Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com> Co-authored-by: roipony <roipony@gmail.com> Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com> Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com> Co-authored-by: admin <bo.wang@jina.ai> Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de> Co-authored-by: Victor <zbwkeepgoing@126.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>

* model: add image support for jina embeddings v4 (#2893) * feat: unify text and image embeddings for all tasks * fix: uniform batch size * fix: update error message * fix: update code task * fix: update max length * fix: apply review suggestions * model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889) * feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> * Add Classification Evaluator unit test (#2838) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: update colpali engine models (#2905) * adding vidore benchmarks * fix typo * clean vidore names + per lang eval * lint * vidore names * bibtex fix * fix revision * vidore v2 citation * update citation format and fix per-language mappings * lint: citations * typo citations * fix revisiions * lint * fix colnomic3b revision * fix colqwen2.5 revision + latest repo version * fix query agmentation tokens * colsmol revision * 1.38.35 Automatically generated by python-semantic-release * Evaluator tests (#2910) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Adding STSEvaluator and SummarizationEvaluator tests * Correcting due to the comments * Correcting due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Classification dataset cleaning (#2900) * Classification dataset cleaning * Update pull request number * Fix metadata test * fix formatting * add script for cleaning * Update tasks & benchmarks tables * dataset: Add JapaneseSentimentClassification (#2913) Add JapaneseSentimentClassification * Update tasks & benchmarks tables * fix: change `passage` prompt to `document` (#2912) * change document to passage * fix prompt names * fix kwargs check * fix default prompt * 1.38.36 Automatically generated by python-semantic-release * model: Add OpenSearch inf-free sparse encoding models (#2903) add opensearch inf-free models Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * dataset: add BarExamQA dataset (#2916) * Add BareExamQA retrieval task * ran linter * updated details * updated details * fixed subtype name * fixed changes * ran linter again * Use `mteb.get_model` in adding_a_dataset.md (#2922) Update adding_a_dataset.md * fix: specify revision for opensearch (#2919) specify revision for opensearch * 1.38.37 Automatically generated by python-semantic-release * Update the link for gemini-embedding-001 (#2928) * fix: replace with passage (#2934) * fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940) * fix: Only import SparseEncoder once sentence-transformer version have been checked fixes #2936 * Update mteb/models/opensearch_neural_sparse_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939) The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue. * docs: Update adding_a_dataset.md (#2947) * docs: Update adding_a_dataset.md * Update docs/adding_a_dataset.md * ci: bump semantic release * 1.38.38 Automatically generated by python-semantic-release * dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935) * BSARD loader fixed * BSARDv2 metadata fixed * Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * dataset: add GovReport dataset (#2953) * Added govreport task * Updated description * dataset: add BillSum datasets (#2943) * Added BillSum datasets * fixed billsumca * Updated BillSumCA description * Updated BillSumUS description * Update mteb/tasks/Retrieval/eng/BillSumCA.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/BillSumUS.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * lint * lint --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716) * Add RuSciBench * fix bitext mining lang * Add regression task * fix init * add missing files * Improve description * Add superseded_by * fix lint * Update regression task to match with v2 * Add stratified_subsampling for regression task * Add boostrap for regression task * Rename task class, add model as evaluator argument * fix import * fix import 2 * fixes * fix * Rename regression model protocol * Update tasks & benchmarks tables * 1.38.39 Automatically generated by python-semantic-release * qzhou-embedding model_meta & implementation (#2975) * qzhou-embedding model_meta & implementation * Update qzhou_models.py * Update qzhou_models.py Processing todo items（Add default instruction） * Update qzhou_models.py correct bge datalist * Update qzhou_models.py correct 'public_training_data' * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update mteb/models/qzhou_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/qzhou_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * format qzhou_models.py for ruff check --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * model: Add Voyage 3.5 model configuration (#3005) Add Voyage 3.5 model configuration - Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens - Set release date to 2025-01-21 with revision 1 - Configure for cosine similarity with instruction support - Include standard Voyage training datasets reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com> * model: BAAI/bge-m3-unsupervised Model (#3007) * Add BAAI/bge-m3-unsupervised Model (BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out) * Remove the commented retromae model --------- Co-authored-by: fzowl <zoltan@voyageai.com> * lint: Correcting lint errors (#3004) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Correcting the lint errors --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * dataset: Added 50 Vietnamese dataset from vn-mteb (#2964) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * model: Add Cohere embed-v4.0 model support (#3006) * Add Cohere embed-v4.0 model support - Add text-only embed-v4.0 model in cohere_models.py - Add multimodal embed-v4.0 model in cohere_v.py - Support configurable dimensions (256, 512, 1024, 1536) - Support 128,000 token context length - Support multimodal embedding (text, images, mixed PDFs) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Add Cohere embed-v4.0 model support Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * Add OpenAI models with 512 dimension (#3008) * Add OpenAI/text-embedding-3-small (512 dim) Add OpenAI/text-embedding-3-large (512 dim) * Correcting due to comments --------- Co-authored-by: fzowl <zoltan@voyageai.com> * Standardise task names and fix citation formatting (#3026) fixes for name formatting * Update tasks & benchmarks tables * fix: Add missing training sets for qzhou (#3023) * Supplement missing training sets * reformat code * Reorganize the data list format * update qzhou_model meta * 1.38.40 Automatically generated by python-semantic-release * model: Add samilpwc_models meta (#3028) * model: Add samilpwc_models meta * Fix: Remove CONST * Fix: Reformat File * Update: model revision * model: Add granite-vision-embedding model (#3029) * Add files via upload * Address review comments * Address review comments * ruff format * Update mteb/models/granite_vision_embedding_models.py * lint error fix --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: incorrect revision for SNLRetrieval (#3033) The provided revisions doesn't seem to be present on: adrlau/navjordj-SNL_summarization_copy Replacing with latest revision * dataset: Add HumanEvalRetrieval task (#3022) * Add HumanEvalRetrieval dataset * Fix TaskMetadata structure and remove descriptive_stats - Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure * Fix dataset path and use verified metadata - Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv * Add correct revision hash to HumanEval - Add revision hash: ed1f48a for reproducibility * Fix HumanEval metadata validation - Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure * Address reviewer feedback - Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility * Fix field names in HumanEval dataset loading Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format. * Fix deprecated metadata_dict usage Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility. * Fix data structure for MTEB compatibility - Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility * Address PR feedback for HumanEval dataset - Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly * Fix BibTeX citation formatting for HumanEvalRetrieval - Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation * Update tasks & benchmarks tables * 1.38.41 Automatically generated by python-semantic-release * ci: reduce parallel runs for when checking if a dataset exists (#3035) The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831) * ci: Updating rerun delays to prevent false positives errors * ci: Updating rerun delays to prevent false positives errors * model: Add GreenNode Vietnamese Embedding models (#2994) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * model: add granite-embedding-english R2 models (#3050) * fix: Updated revision for jina-embeddings-v4 (#3046) * fix: jinav4 revision Signed-off-by: admin <bo.wang@jina.ai> * change revision instead of removing it Signed-off-by: admin <bo.wang@jina.ai> --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: admin <bo.wang@jina.ai> * 1.38.42 Automatically generated by python-semantic-release * Fix 3 VN-MTEB Pair Classification tasks (#3053) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * [FIX] VN-MTEB 3 datasets PairClassification rename column * dataset: Add mbpp retrieval (#3037) * Add MBPP retrieval task - Code retrieval task based on 378 Python programming problems - Natural language queries matched to Python code implementations - Uses python-Code evaluation language for code-specific metrics - Includes proper citations and descriptive statistics * Add MBPPRetrieval to imports * Add descriptive statistics for MBPPRetrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * dataset: Added wikisql retrieval (#3039) * Add WikiSQL retrieval task - Code retrieval task based on WikiSQL natural language to SQL dataset - Natural language questions matched to SQL query implementations - Uses sql-Code evaluation language for SQL-specific metrics - Includes proper citations and descriptive statistics * Add WikiSQLRetrieval to imports * Add descriptive statistics for WikiSQLRetrieval * Reformatting * Reformatting * Reformatting, correcting the revision * Update tasks & benchmarks tables * ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors try to fix CI * fix MBPPRetrieval revision (#3055) Update MBPPRetrieval.py Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Add VN-MTEB benchmark and Leaderboard (#2995) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] VN-MTEB benchmark and leaderboard * [FIX] wrong benchmark name * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * 1.38.43 Automatically generated by python-semantic-release * Add hc3finance retrieval (#3041) * Add HC3Finance retrieval task - Financial retrieval task based on HC3 Finance dataset - Financial questions matched to human and AI-generated content - Covers financial explanations, analysis, and educational content - Includes proper citations and descriptive statistics * Add HC3FinanceRetrieval to imports * Add descriptive statistics for HC3FinanceRetrieval * Reformatting * Reformatting, correcting the revision * Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Add finqa retrieval (#3042) * Add FinQA retrieval task - Financial numerical reasoning retrieval task based on FinQA dataset - Numerical financial questions matched to relevant document data - Covers earnings reports with tables and quantitative financial data - Includes proper citations and descriptive statistics * Add FinQARetrieval to imports * Add descriptive statistics for FinQARetrieval * Reformatting * Reformatting * Update mteb/tasks/Retrieval/eng/FinQARetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FinanceBenchRetrieval task (#3044) * Add FinanceBenchRetrieval * Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FreshStackRetrieval task (#3043) * Add FreshStackRetrieval * Reformatting, correcting the revision * Dataset correction * Update tasks & benchmarks tables * dataset: Add ds1000 retrieval (#3038) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * Add ChatDoctorRetrieval (#3045) * Add ChatDoctorRetrieval * Reformatting, correcting the revision * Correct the dataset citation * Correcting due to comments * Update tasks & benchmarks tables * Correcting the (new) DS1000 dataset's revision (#3063) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Add DS1000Retrieval task implementation * dataset: Add JinaVDR (#2942) * feat: added jinavdr benchmark * feat: added description for jinavdr * feat: fixed licenses and added bibtex * feat: made jinav4 compatible with vidore benchmark * feat: corrected query numbers * feat: removed print * feat: added max pixel argument for jina models * feat: score calculation on cpu * feat: adjust jina model for new mteb code * feat: code cleanup * feat: corrected bibtex * feat: make colpali run with jinavdr * feat: fixed comments * feat: better reference and fixed comments * feat: added date for tasks * feat: fixed missing metadata and bibtex * feat: added descriptions per dataset * Update tasks & benchmarks tables * model: Add CoDi-Embedding-V1 (#3054) * add codiemb-minicpm * replace codiemb_minicpm with codi_model * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update code * update code * reformat --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: ensure that there are always relevant docs attached to query (#3058) * fix: ensure that there are always relevant docs attached to query Here is brief test that it doesn't influence scores: ```py t1 = mteb.get_task("TwitterHjerneRetrieval") meta = mteb.get_model_meta("minishlab/potion-base-2M") eval = mteb.MTEB(tasks=[t1]) res = eval.run(model=meta.load_model()) # before fix: res[0].get_score() # np.float64(0.02837) res[0].scores before_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # with update: res[0].get_score() # np.float64(0.02837) res[0].scores with_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # check with_fix == before_fix # True * restructure * format * relax pytrec versions * fix incorrect parsing * 1.38.44 Automatically generated by python-semantic-release * Correcting the JINA models with SentenceTransformerWrapper (#3071) * ci: Add stale workflow (#3066) * add stale workflow * add permissions * add bug label to bug issue template * revert bug issue and only look at more info needed issues * more accurate name * override default * fix: open_clip package validation (#3073) * 1.38.45 Automatically generated by python-semantic-release * fix: Update revision for qzhou models (#3069) * 1.38.46 Automatically generated by python-semantic-release * Fix the reference link for CoDi-Embedding-V1 (#3075) Fix reference link * fix: Add beta version of RTEB related benchmarks (#3048) * Add RTEB related benchmarks * Add RTEB related benchmarks * Correcting the task names in the RTEB benchmarks * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Adding the CURE dataset to RTEB benchmarks * Use the right language subset * Fix broken finance icon URL in RTEB benchmarks Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg Validated all icon URLs and confirmed accessibility compliance * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * 1.38.47 Automatically generated by python-semantic-release * fix: run `ruff check` on all files during ci (#3086) * fix: run `ruff check` on all files during ci * format * 1.38.48 Automatically generated by python-semantic-release * Move dev to dependency groups (#3088) add dependency groups * fix: Improving validate_task_to_prompt_name logs and error messages (#3079) * Improving validate_task_to_prompt_name logs and error messages * linter fixes * Adding None prompts tests * Update test_benchmark_sentence_transformer * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: duplicate mteb multilingual variables (#3080) * fix benchmark naming * format * lint * Update tasks & benchmarks tables * model: mdbr-leaf models (#3081) * added MDBR leaf models * fixed revision for mdbr-leaf-ir * added model prompts * updated training datasets * fixed linting * lotte task reference --------- Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> * 1.38.49 Automatically generated by python-semantic-release * CI: Set upper limit for xdist version (#3098) * Commentout bibtex formatting * Remove `-n auto` * get back bibtex * try limiting versions * revert coverage * revert coverage --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Combine Plots and Tables into a Single (#3047) * feat - Combine Plots and Tables into a Single Tab #3009 * feat - Resize the plot to make it more readable * feat - Remove the (radar chart) * feat - Add a comment stating that it only shows the Top 5 models in the table. * feat - adjust layout * Update mteb/leaderboard/app.py * format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * mteb importable * format * fix model implementations * fix `validate_task_to_prompt_name` * align regression task with others * remove model overview * remove partials * format * fix tests * fix evaluators tests * add trust remote code to bsard * pre-commit run all files * add all descriptive stats * fix trust remote code test * add `RetrievalSplitData` to reranking --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai> Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com> Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp> Co-authored-by: zhichao-aws <zhichaog@amazon.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com> Co-authored-by: Feiyang <feiyangc@google.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Nikolay Banar <nikc20008@gmail.com> Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: fzowl <zoltan@voyageai.com> Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com> Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com> Co-authored-by: roipony <roipony@gmail.com> Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com> Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com> Co-authored-by: admin <bo.wang@jina.ai> Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de> Co-authored-by: Victor <zbwkeepgoing@126.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: Ryan Mullins <ryan@ryanmullins.org> Co-authored-by: Robin Vujanic <robin-vjc@users.noreply.github.com> Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>

* model: add image support for jina embeddings v4 (#2893) * feat: unify text and image embeddings for all tasks * fix: uniform batch size * fix: update error message * fix: update code task * fix: update max length * fix: apply review suggestions * model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889) * feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> * Add Classification Evaluator unit test (#2838) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: update colpali engine models (#2905) * adding vidore benchmarks * fix typo * clean vidore names + per lang eval * lint * vidore names * bibtex fix * fix revision * vidore v2 citation * update citation format and fix per-language mappings * lint: citations * typo citations * fix revisiions * lint * fix colnomic3b revision * fix colqwen2.5 revision + latest repo version * fix query agmentation tokens * colsmol revision * 1.38.35 Automatically generated by python-semantic-release * Evaluator tests (#2910) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Adding STSEvaluator and SummarizationEvaluator tests * Correcting due to the comments * Correcting due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Classification dataset cleaning (#2900) * Classification dataset cleaning * Update pull request number * Fix metadata test * fix formatting * add script for cleaning * Update tasks & benchmarks tables * dataset: Add JapaneseSentimentClassification (#2913) Add JapaneseSentimentClassification * Update tasks & benchmarks tables * fix: change `passage` prompt to `document` (#2912) * change document to passage * fix prompt names * fix kwargs check * fix default prompt * 1.38.36 Automatically generated by python-semantic-release * model: Add OpenSearch inf-free sparse encoding models (#2903) add opensearch inf-free models Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * dataset: add BarExamQA dataset (#2916) * Add BareExamQA retrieval task * ran linter * updated details * updated details * fixed subtype name * fixed changes * ran linter again * Use `mteb.get_model` in adding_a_dataset.md (#2922) Update adding_a_dataset.md * fix: specify revision for opensearch (#2919) specify revision for opensearch * 1.38.37 Automatically generated by python-semantic-release * Update the link for gemini-embedding-001 (#2928) * fix: replace with passage (#2934) * fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940) * fix: Only import SparseEncoder once sentence-transformer version have been checked fixes #2936 * Update mteb/models/opensearch_neural_sparse_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939) The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue. * docs: Update adding_a_dataset.md (#2947) * docs: Update adding_a_dataset.md * Update docs/adding_a_dataset.md * ci: bump semantic release * 1.38.38 Automatically generated by python-semantic-release * dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935) * BSARD loader fixed * BSARDv2 metadata fixed * Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * dataset: add GovReport dataset (#2953) * Added govreport task * Updated description * dataset: add BillSum datasets (#2943) * Added BillSum datasets * fixed billsumca * Updated BillSumCA description * Updated BillSumUS description * Update mteb/tasks/Retrieval/eng/BillSumCA.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/BillSumUS.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * lint * lint --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716) * Add RuSciBench * fix bitext mining lang * Add regression task * fix init * add missing files * Improve description * Add superseded_by * fix lint * Update regression task to match with v2 * Add stratified_subsampling for regression task * Add boostrap for regression task * Rename task class, add model as evaluator argument * fix import * fix import 2 * fixes * fix * Rename regression model protocol * Update tasks & benchmarks tables * 1.38.39 Automatically generated by python-semantic-release * qzhou-embedding model_meta & implementation (#2975) * qzhou-embedding model_meta & implementation * Update qzhou_models.py * Update qzhou_models.py Processing todo items（Add default instruction） * Update qzhou_models.py correct bge datalist * Update qzhou_models.py correct 'public_training_data' * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update mteb/models/qzhou_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/qzhou_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * format qzhou_models.py for ruff check --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * model: Add Voyage 3.5 model configuration (#3005) Add Voyage 3.5 model configuration - Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens - Set release date to 2025-01-21 with revision 1 - Configure for cosine similarity with instruction support - Include standard Voyage training datasets reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com> * model: BAAI/bge-m3-unsupervised Model (#3007) * Add BAAI/bge-m3-unsupervised Model (BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out) * Remove the commented retromae model --------- Co-authored-by: fzowl <zoltan@voyageai.com> * lint: Correcting lint errors (#3004) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Correcting the lint errors --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * dataset: Added 50 Vietnamese dataset from vn-mteb (#2964) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * model: Add Cohere embed-v4.0 model support (#3006) * Add Cohere embed-v4.0 model support - Add text-only embed-v4.0 model in cohere_models.py - Add multimodal embed-v4.0 model in cohere_v.py - Support configurable dimensions (256, 512, 1024, 1536) - Support 128,000 token context length - Support multimodal embedding (text, images, mixed PDFs) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Add Cohere embed-v4.0 model support Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * Add OpenAI models with 512 dimension (#3008) * Add OpenAI/text-embedding-3-small (512 dim) Add OpenAI/text-embedding-3-large (512 dim) * Correcting due to comments --------- Co-authored-by: fzowl <zoltan@voyageai.com> * Standardise task names and fix citation formatting (#3026) fixes for name formatting * Update tasks & benchmarks tables * fix: Add missing training sets for qzhou (#3023) * Supplement missing training sets * reformat code * Reorganize the data list format * update qzhou_model meta * 1.38.40 Automatically generated by python-semantic-release * model: Add samilpwc_models meta (#3028) * model: Add samilpwc_models meta * Fix: Remove CONST * Fix: Reformat File * Update: model revision * model: Add granite-vision-embedding model (#3029) * Add files via upload * Address review comments * Address review comments * ruff format * Update mteb/models/granite_vision_embedding_models.py * lint error fix --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: incorrect revision for SNLRetrieval (#3033) The provided revisions doesn't seem to be present on: adrlau/navjordj-SNL_summarization_copy Replacing with latest revision * dataset: Add HumanEvalRetrieval task (#3022) * Add HumanEvalRetrieval dataset * Fix TaskMetadata structure and remove descriptive_stats - Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure * Fix dataset path and use verified metadata - Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv * Add correct revision hash to HumanEval - Add revision hash: ed1f48a for reproducibility * Fix HumanEval metadata validation - Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure * Address reviewer feedback - Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility * Fix field names in HumanEval dataset loading Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format. * Fix deprecated metadata_dict usage Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility. * Fix data structure for MTEB compatibility - Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility * Address PR feedback for HumanEval dataset - Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly * Fix BibTeX citation formatting for HumanEvalRetrieval - Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation * Update tasks & benchmarks tables * 1.38.41 Automatically generated by python-semantic-release * ci: reduce parallel runs for when checking if a dataset exists (#3035) The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831) * ci: Updating rerun delays to prevent false positives errors * ci: Updating rerun delays to prevent false positives errors * model: Add GreenNode Vietnamese Embedding models (#2994) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * model: add granite-embedding-english R2 models (#3050) * fix: Updated revision for jina-embeddings-v4 (#3046) * fix: jinav4 revision Signed-off-by: admin <bo.wang@jina.ai> * change revision instead of removing it Signed-off-by: admin <bo.wang@jina.ai> --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: admin <bo.wang@jina.ai> * 1.38.42 Automatically generated by python-semantic-release * Fix 3 VN-MTEB Pair Classification tasks (#3053) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * [FIX] VN-MTEB 3 datasets PairClassification rename column * dataset: Add mbpp retrieval (#3037) * Add MBPP retrieval task - Code retrieval task based on 378 Python programming problems - Natural language queries matched to Python code implementations - Uses python-Code evaluation language for code-specific metrics - Includes proper citations and descriptive statistics * Add MBPPRetrieval to imports * Add descriptive statistics for MBPPRetrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * dataset: Added wikisql retrieval (#3039) * Add WikiSQL retrieval task - Code retrieval task based on WikiSQL natural language to SQL dataset - Natural language questions matched to SQL query implementations - Uses sql-Code evaluation language for SQL-specific metrics - Includes proper citations and descriptive statistics * Add WikiSQLRetrieval to imports * Add descriptive statistics for WikiSQLRetrieval * Reformatting * Reformatting * Reformatting, correcting the revision * Update tasks & benchmarks tables * ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors try to fix CI * fix MBPPRetrieval revision (#3055) Update MBPPRetrieval.py Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Add VN-MTEB benchmark and Leaderboard (#2995) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] VN-MTEB benchmark and leaderboard * [FIX] wrong benchmark name * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * 1.38.43 Automatically generated by python-semantic-release * Add hc3finance retrieval (#3041) * Add HC3Finance retrieval task - Financial retrieval task based on HC3 Finance dataset - Financial questions matched to human and AI-generated content - Covers financial explanations, analysis, and educational content - Includes proper citations and descriptive statistics * Add HC3FinanceRetrieval to imports * Add descriptive statistics for HC3FinanceRetrieval * Reformatting * Reformatting, correcting the revision * Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Add finqa retrieval (#3042) * Add FinQA retrieval task - Financial numerical reasoning retrieval task based on FinQA dataset - Numerical financial questions matched to relevant document data - Covers earnings reports with tables and quantitative financial data - Includes proper citations and descriptive statistics * Add FinQARetrieval to imports * Add descriptive statistics for FinQARetrieval * Reformatting * Reformatting * Update mteb/tasks/Retrieval/eng/FinQARetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FinanceBenchRetrieval task (#3044) * Add FinanceBenchRetrieval * Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FreshStackRetrieval task (#3043) * Add FreshStackRetrieval * Reformatting, correcting the revision * Dataset correction * Update tasks & benchmarks tables * dataset: Add ds1000 retrieval (#3038) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * Add ChatDoctorRetrieval (#3045) * Add ChatDoctorRetrieval * Reformatting, correcting the revision * Correct the dataset citation * Correcting due to comments * Update tasks & benchmarks tables * Correcting the (new) DS1000 dataset's revision (#3063) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Add DS1000Retrieval task implementation * dataset: Add JinaVDR (#2942) * feat: added jinavdr benchmark * feat: added description for jinavdr * feat: fixed licenses and added bibtex * feat: made jinav4 compatible with vidore benchmark * feat: corrected query numbers * feat: removed print * feat: added max pixel argument for jina models * feat: score calculation on cpu * feat: adjust jina model for new mteb code * feat: code cleanup * feat: corrected bibtex * feat: make colpali run with jinavdr * feat: fixed comments * feat: better reference and fixed comments * feat: added date for tasks * feat: fixed missing metadata and bibtex * feat: added descriptions per dataset * Update tasks & benchmarks tables * model: Add CoDi-Embedding-V1 (#3054) * add codiemb-minicpm * replace codiemb_minicpm with codi_model * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update code * update code * reformat --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: ensure that there are always relevant docs attached to query (#3058) * fix: ensure that there are always relevant docs attached to query Here is brief test that it doesn't influence scores: ```py t1 = mteb.get_task("TwitterHjerneRetrieval") meta = mteb.get_model_meta("minishlab/potion-base-2M") eval = mteb.MTEB(tasks=[t1]) res = eval.run(model=meta.load_model()) # before fix: res[0].get_score() # np.float64(0.02837) res[0].scores before_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # with update: res[0].get_score() # np.float64(0.02837) res[0].scores with_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # check with_fix == before_fix # True * restructure * format * relax pytrec versions * fix incorrect parsing * 1.38.44 Automatically generated by python-semantic-release * Correcting the JINA models with SentenceTransformerWrapper (#3071) * ci: Add stale workflow (#3066) * add stale workflow * add permissions * add bug label to bug issue template * revert bug issue and only look at more info needed issues * more accurate name * override default * fix: open_clip package validation (#3073) * 1.38.45 Automatically generated by python-semantic-release * fix: Update revision for qzhou models (#3069) * 1.38.46 Automatically generated by python-semantic-release * Fix the reference link for CoDi-Embedding-V1 (#3075) Fix reference link * fix: Add beta version of RTEB related benchmarks (#3048) * Add RTEB related benchmarks * Add RTEB related benchmarks * Correcting the task names in the RTEB benchmarks * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Adding the CURE dataset to RTEB benchmarks * Use the right language subset * Fix broken finance icon URL in RTEB benchmarks Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg Validated all icon URLs and confirmed accessibility compliance * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * 1.38.47 Automatically generated by python-semantic-release * fix: run `ruff check` on all files during ci (#3086) * fix: run `ruff check` on all files during ci * format * 1.38.48 Automatically generated by python-semantic-release * Move dev to dependency groups (#3088) add dependency groups * fix: Improving validate_task_to_prompt_name logs and error messages (#3079) * Improving validate_task_to_prompt_name logs and error messages * linter fixes * Adding None prompts tests * Update test_benchmark_sentence_transformer * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: duplicate mteb multilingual variables (#3080) * fix benchmark naming * format * lint * Update tasks & benchmarks tables * model: mdbr-leaf models (#3081) * added MDBR leaf models * fixed revision for mdbr-leaf-ir * added model prompts * updated training datasets * fixed linting * lotte task reference --------- Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> * 1.38.49 Automatically generated by python-semantic-release * CI: Set upper limit for xdist version (#3098) * Commentout bibtex formatting * Remove `-n auto` * get back bibtex * try limiting versions * revert coverage * revert coverage --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Combine Plots and Tables into a Single (#3047) * feat - Combine Plots and Tables into a Single Tab #3009 * feat - Resize the plot to make it more readable * feat - Remove the (radar chart) * feat - Add a comment stating that it only shows the Top 5 models in the table. * feat - adjust layout * Update mteb/leaderboard/app.py * format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Updating the default batch size calculation in the voyage models (#3091) * 1.38.50 Automatically generated by python-semantic-release * fix: Add @classmethod for @field_validators in TaskMetadata (#3100) * Align task prompt dict with `PromptType` (#3101) * align task prompt dict with `PromptType` * use value instead of enum * 1.38.51 Automatically generated by python-semantic-release * model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090) * Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 * Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth) * Format with ruff + add loader per review * Apply ruff format/fixes * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader) * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix import * Add memory_usage_mb=808.0 and required fields to ModelMeta * Fix 210 milions of parameters --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: Allow closed datasets (#3059) * - Added an include_private parameter to the get_tasks() function that defaults to False - This ensures that by default, tests only run on public datasets - Tests can explicitly set include_private=True when needed to test private datasets - Added is_public: bool | None = None field to TaskMetadata - The field is optional and defaults to None (treated as public) - Updated the is_filled() method to exclude is_public from required fields - Added documentation * - Added an include_private parameter to the get_tasks() function that defaults to False - This ensures that by default, tests only run on public datasets - Tests can explicitly set include_private=True when needed to test private datasets - Added is_public: bool | None = None field to TaskMetadata - The field is optional and defaults to None (treated as public) - Updated the is_filled() method to exclude is_public from required fields - Added documentation * Correcting due to comments * Update mteb/abstasks/TaskMetadata.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/overview.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Removing the not used filter_tasks_by_privacy function * Correcting due to comments * Correcting due to comments * Correcting due to comments * Removing the test case * Rename the include_private parameter to exclude_private * Rename the include_private parameter to exclude_private * Add private tasks tests * Add private tasks tests * Update tests/test_tasks/test_private_tasks.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Add private tasks tests * Add private tasks tests * Add private tasks tests --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 1.38.52 Automatically generated by python-semantic-release --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai> Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com> Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: zhichao-aws <zhichaog@amazon.com> Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com> Co-authored-by: Feiyang <feiyangc@google.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Nikolay Banar <nikc20008@gmail.com> Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: fzowl <zoltan@voyageai.com> Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com> Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com> Co-authored-by: roipony <roipony@gmail.com> Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com> Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com> Co-authored-by: admin <bo.wang@jina.ai> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de> Co-authored-by: Victor <zbwkeepgoing@126.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: Ryan Mullins <ryan@ryanmullins.org> Co-authored-by: Robin Vujanic <robin-vjc@users.noreply.github.com> Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com> Co-authored-by: mathlesage <134429083+mathlesage@users.noreply.github.com>

* model: add image support for jina embeddings v4 (#2893) * feat: unify text and image embeddings for all tasks * fix: uniform batch size * fix: update error message * fix: update code task * fix: update max length * fix: apply review suggestions * model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889) * feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> * Add Classification Evaluator unit test (#2838) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: update colpali engine models (#2905) * adding vidore benchmarks * fix typo * clean vidore names + per lang eval * lint * vidore names * bibtex fix * fix revision * vidore v2 citation * update citation format and fix per-language mappings * lint: citations * typo citations * fix revisiions * lint * fix colnomic3b revision * fix colqwen2.5 revision + latest repo version * fix query agmentation tokens * colsmol revision * 1.38.35 Automatically generated by python-semantic-release * Evaluator tests (#2910) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Adding STSEvaluator and SummarizationEvaluator tests * Correcting due to the comments * Correcting due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Classification dataset cleaning (#2900) * Classification dataset cleaning * Update pull request number * Fix metadata test * fix formatting * add script for cleaning * Update tasks & benchmarks tables * dataset: Add JapaneseSentimentClassification (#2913) Add JapaneseSentimentClassification * Update tasks & benchmarks tables * fix: change `passage` prompt to `document` (#2912) * change document to passage * fix prompt names * fix kwargs check * fix default prompt * 1.38.36 Automatically generated by python-semantic-release * model: Add OpenSearch inf-free sparse encoding models (#2903) add opensearch inf-free models Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * dataset: add BarExamQA dataset (#2916) * Add BareExamQA retrieval task * ran linter * updated details * updated details * fixed subtype name * fixed changes * ran linter again * Use `mteb.get_model` in adding_a_dataset.md (#2922) Update adding_a_dataset.md * fix: specify revision for opensearch (#2919) specify revision for opensearch * 1.38.37 Automatically generated by python-semantic-release * Update the link for gemini-embedding-001 (#2928) * fix: replace with passage (#2934) * fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940) * fix: Only import SparseEncoder once sentence-transformer version have been checked fixes #2936 * Update mteb/models/opensearch_neural_sparse_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939) The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue. * docs: Update adding_a_dataset.md (#2947) * docs: Update adding_a_dataset.md * Update docs/adding_a_dataset.md * ci: bump semantic release * 1.38.38 Automatically generated by python-semantic-release * dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935) * BSARD loader fixed * BSARDv2 metadata fixed * Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * dataset: add GovReport dataset (#2953) * Added govreport task * Updated description * dataset: add BillSum datasets (#2943) * Added BillSum datasets * fixed billsumca * Updated BillSumCA description * Updated BillSumUS description * Update mteb/tasks/Retrieval/eng/BillSumCA.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/BillSumUS.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * lint * lint --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716) * Add RuSciBench * fix bitext mining lang * Add regression task * fix init * add missing files * Improve description * Add superseded_by * fix lint * Update regression task to match with v2 * Add stratified_subsampling for regression task * Add boostrap for regression task * Rename task class, add model as evaluator argument * fix import * fix import 2 * fixes * fix * Rename regression model protocol * Update tasks & benchmarks tables * 1.38.39 Automatically generated by python-semantic-release * qzhou-embedding model_meta & implementation (#2975) * qzhou-embedding model_meta & implementation * Update qzhou_models.py * Update qzhou_models.py Processing todo items（Add default instruction） * Update qzhou_models.py correct bge datalist * Update qzhou_models.py correct 'public_training_data' * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update mteb/models/qzhou_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/qzhou_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * format qzhou_models.py for ruff check --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * model: Add Voyage 3.5 model configuration (#3005) Add Voyage 3.5 model configuration - Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens - Set release date to 2025-01-21 with revision 1 - Configure for cosine similarity with instruction support - Include standard Voyage training datasets reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com> * model: BAAI/bge-m3-unsupervised Model (#3007) * Add BAAI/bge-m3-unsupervised Model (BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out) * Remove the commented retromae model --------- Co-authored-by: fzowl <zoltan@voyageai.com> * lint: Correcting lint errors (#3004) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Correcting the lint errors --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * dataset: Added 50 Vietnamese dataset from vn-mteb (#2964) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * model: Add Cohere embed-v4.0 model support (#3006) * Add Cohere embed-v4.0 model support - Add text-only embed-v4.0 model in cohere_models.py - Add multimodal embed-v4.0 model in cohere_v.py - Support configurable dimensions (256, 512, 1024, 1536) - Support 128,000 token context length - Support multimodal embedding (text, images, mixed PDFs) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Add Cohere embed-v4.0 model support Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * Add OpenAI models with 512 dimension (#3008) * Add OpenAI/text-embedding-3-small (512 dim) Add OpenAI/text-embedding-3-large (512 dim) * Correcting due to comments --------- Co-authored-by: fzowl <zoltan@voyageai.com> * Standardise task names and fix citation formatting (#3026) fixes for name formatting * Update tasks & benchmarks tables * fix: Add missing training sets for qzhou (#3023) * Supplement missing training sets * reformat code * Reorganize the data list format * update qzhou_model meta * 1.38.40 Automatically generated by python-semantic-release * model: Add samilpwc_models meta (#3028) * model: Add samilpwc_models meta * Fix: Remove CONST * Fix: Reformat File * Update: model revision * model: Add granite-vision-embedding model (#3029) * Add files via upload * Address review comments * Address review comments * ruff format * Update mteb/models/granite_vision_embedding_models.py * lint error fix --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: incorrect revision for SNLRetrieval (#3033) The provided revisions doesn't seem to be present on: adrlau/navjordj-SNL_summarization_copy Replacing with latest revision * dataset: Add HumanEvalRetrieval task (#3022) * Add HumanEvalRetrieval dataset * Fix TaskMetadata structure and remove descriptive_stats - Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure * Fix dataset path and use verified metadata - Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv * Add correct revision hash to HumanEval - Add revision hash: ed1f48a for reproducibility * Fix HumanEval metadata validation - Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure * Address reviewer feedback - Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility * Fix field names in HumanEval dataset loading Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format. * Fix deprecated metadata_dict usage Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility. * Fix data structure for MTEB compatibility - Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility * Address PR feedback for HumanEval dataset - Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly * Fix BibTeX citation formatting for HumanEvalRetrieval - Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation * Update tasks & benchmarks tables * 1.38.41 Automatically generated by python-semantic-release * ci: reduce parallel runs for when checking if a dataset exists (#3035) The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831) * ci: Updating rerun delays to prevent false positives errors * ci: Updating rerun delays to prevent false positives errors * model: Add GreenNode Vietnamese Embedding models (#2994) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * model: add granite-embedding-english R2 models (#3050) * fix: Updated revision for jina-embeddings-v4 (#3046) * fix: jinav4 revision Signed-off-by: admin <bo.wang@jina.ai> * change revision instead of removing it Signed-off-by: admin <bo.wang@jina.ai> --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: admin <bo.wang@jina.ai> * 1.38.42 Automatically generated by python-semantic-release * Fix 3 VN-MTEB Pair Classification tasks (#3053) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * [FIX] VN-MTEB 3 datasets PairClassification rename column * dataset: Add mbpp retrieval (#3037) * Add MBPP retrieval task - Code retrieval task based on 378 Python programming problems - Natural language queries matched to Python code implementations - Uses python-Code evaluation language for code-specific metrics - Includes proper citations and descriptive statistics * Add MBPPRetrieval to imports * Add descriptive statistics for MBPPRetrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * dataset: Added wikisql retrieval (#3039) * Add WikiSQL retrieval task - Code retrieval task based on WikiSQL natural language to SQL dataset - Natural language questions matched to SQL query implementations - Uses sql-Code evaluation language for SQL-specific metrics - Includes proper citations and descriptive statistics * Add WikiSQLRetrieval to imports * Add descriptive statistics for WikiSQLRetrieval * Reformatting * Reformatting * Reformatting, correcting the revision * Update tasks & benchmarks tables * ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors try to fix CI * fix MBPPRetrieval revision (#3055) Update MBPPRetrieval.py Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Add VN-MTEB benchmark and Leaderboard (#2995) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] VN-MTEB benchmark and leaderboard * [FIX] wrong benchmark name * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * 1.38.43 Automatically generated by python-semantic-release * Add hc3finance retrieval (#3041) * Add HC3Finance retrieval task - Financial retrieval task based on HC3 Finance dataset - Financial questions matched to human and AI-generated content - Covers financial explanations, analysis, and educational content - Includes proper citations and descriptive statistics * Add HC3FinanceRetrieval to imports * Add descriptive statistics for HC3FinanceRetrieval * Reformatting * Reformatting, correcting the revision * Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Add finqa retrieval (#3042) * Add FinQA retrieval task - Financial numerical reasoning retrieval task based on FinQA dataset - Numerical financial questions matched to relevant document data - Covers earnings reports with tables and quantitative financial data - Includes proper citations and descriptive statistics * Add FinQARetrieval to imports * Add descriptive statistics for FinQARetrieval * Reformatting * Reformatting * Update mteb/tasks/Retrieval/eng/FinQARetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FinanceBenchRetrieval task (#3044) * Add FinanceBenchRetrieval * Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FreshStackRetrieval task (#3043) * Add FreshStackRetrieval * Reformatting, correcting the revision * Dataset correction * Update tasks & benchmarks tables * dataset: Add ds1000 retrieval (#3038) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * Add ChatDoctorRetrieval (#3045) * Add ChatDoctorRetrieval * Reformatting, correcting the revision * Correct the dataset citation * Correcting due to comments * Update tasks & benchmarks tables * Correcting the (new) DS1000 dataset's revision (#3063) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Add DS1000Retrieval task implementation * dataset: Add JinaVDR (#2942) * feat: added jinavdr benchmark * feat: added description for jinavdr * feat: fixed licenses and added bibtex * feat: made jinav4 compatible with vidore benchmark * feat: corrected query numbers * feat: removed print * feat: added max pixel argument for jina models * feat: score calculation on cpu * feat: adjust jina model for new mteb code * feat: code cleanup * feat: corrected bibtex * feat: make colpali run with jinavdr * feat: fixed comments * feat: better reference and fixed comments * feat: added date for tasks * feat: fixed missing metadata and bibtex * feat: added descriptions per dataset * Update tasks & benchmarks tables * model: Add CoDi-Embedding-V1 (#3054) * add codiemb-minicpm * replace codiemb_minicpm with codi_model * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update code * update code * reformat --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: ensure that there are always relevant docs attached to query (#3058) * fix: ensure that there are always relevant docs attached to query Here is brief test that it doesn't influence scores: ```py t1 = mteb.get_task("TwitterHjerneRetrieval") meta = mteb.get_model_meta("minishlab/potion-base-2M") eval = mteb.MTEB(tasks=[t1]) res = eval.run(model=meta.load_model()) # before fix: res[0].get_score() # np.float64(0.02837) res[0].scores before_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # with update: res[0].get_score() # np.float64(0.02837) res[0].scores with_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # check with_fix == before_fix # True * restructure * format * relax pytrec versions * fix incorrect parsing * 1.38.44 Automatically generated by python-semantic-release * Correcting the JINA models with SentenceTransformerWrapper (#3071) * ci: Add stale workflow (#3066) * add stale workflow * add permissions * add bug label to bug issue template * revert bug issue and only look at more info needed issues * more accurate name * override default * fix: open_clip package validation (#3073) * 1.38.45 Automatically generated by python-semantic-release * fix: Update revision for qzhou models (#3069) * 1.38.46 Automatically generated by python-semantic-release * Fix the reference link for CoDi-Embedding-V1 (#3075) Fix reference link * fix: Add beta version of RTEB related benchmarks (#3048) * Add RTEB related benchmarks * Add RTEB related benchmarks * Correcting the task names in the RTEB benchmarks * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Adding the CURE dataset to RTEB benchmarks * Use the right language subset * Fix broken finance icon URL in RTEB benchmarks Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg Validated all icon URLs and confirmed accessibility compliance * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * 1.38.47 Automatically generated by python-semantic-release * fix: run `ruff check` on all files during ci (#3086) * fix: run `ruff check` on all files during ci * format * 1.38.48 Automatically generated by python-semantic-release * Move dev to dependency groups (#3088) add dependency groups * fix: Improving validate_task_to_prompt_name logs and error messages (#3079) * Improving validate_task_to_prompt_name logs and error messages * linter fixes * Adding None prompts tests * Update test_benchmark_sentence_transformer * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: duplicate mteb multilingual variables (#3080) * fix benchmark naming * format * lint * Update tasks & benchmarks tables * model: mdbr-leaf models (#3081) * added MDBR leaf models * fixed revision for mdbr-leaf-ir * added model prompts * updated training datasets * fixed linting * lotte task reference --------- Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> * 1.38.49 Automatically generated by python-semantic-release * CI: Set upper limit for xdist version (#3098) * Commentout bibtex formatting * Remove `-n auto` * get back bibtex * try limiting versions * revert coverage * revert coverage --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Combine Plots and Tables into a Single (#3047) * feat - Combine Plots and Tables into a Single Tab #3009 * feat - Resize the plot to make it more readable * feat - Remove the (radar chart) * feat - Add a comment stating that it only shows the Top 5 models in the table. * feat - adjust layout * Update mteb/leaderboard/app.py * format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Updating the default batch size calculation in the voyage models (#3091) * 1.38.50 Automatically generated by python-semantic-release * fix: Add @classmethod for @field_validators in TaskMetadata (#3100) * Align task prompt dict with `PromptType` (#3101) * align task prompt dict with `PromptType` * use value instead of enum * 1.38.51 Automatically generated by python-semantic-release * model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090) * Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 * Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth) * Format with ruff + add loader per review * Apply ruff format/fixes * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader) * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix import * Add memory_usage_mb=808.0 and required fields to ModelMeta * Fix 210 milions of parameters --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: Allow closed datasets (#3059) * - Added an include_private parameter to the get_tasks() function that defaults to False - This ensures that by default, tests only run on public datasets - Tests can explicitly set include_private=True when needed to test private datasets - Added is_public: bool | None = None field to TaskMetadata - The field is optional and defaults to None (treated as public) - Updated the is_filled() method to exclude is_public from required fields - Added documentation * - Added an include_private parameter to the get_tasks() function that defaults to False - This ensures that by default, tests only run on public datasets - Tests can explicitly set include_private=True when needed to test private datasets - Added is_public: bool | None = None field to TaskMetadata - The field is optional and defaults to None (treated as public) - Updated the is_filled() method to exclude is_public from required fields - Added documentation * Correcting due to comments * Update mteb/abstasks/TaskMetadata.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/overview.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Removing the not used filter_tasks_by_privacy function * Correcting due to comments * Correcting due to comments * Correcting due to comments * Removing the test case * Rename the include_private parameter to exclude_private * Rename the include_private parameter to exclude_private * Add private tasks tests * Add private tasks tests * Update tests/test_tasks/test_private_tasks.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Add private tasks tests * Add private tasks tests * Add private tasks tests --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 1.38.52 Automatically generated by python-semantic-release * Ci: test out GH models with welcoming new comers (#3112) test out GH models with welcoming new comers * ci: Dataset check on new PR (#3103) * add dataset check on new PR * add extract datasets * run as module * update startswith * update workflow name * add GitPython * export var * same shell session * address review comments * add to docs to say what this script does * add docs * model: add Youtu-Embedding-V1 (#3115) * add youtu models * add a blank line * fix the optional dependencies and lint the code * remove unused dependencies and reformat * revise prompt_type --------- Co-authored-by: springxchen <springxchen@tencent.com> * fix: add voyage quantization models (#3092) * Adding quantization support * Update mteb/models/voyage_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/model_meta.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/model_meta.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Simplifying the quantization/output_dtype * Update mteb/model_meta.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * 1.38.53 Automatically generated by python-semantic-release * model: EmbeddingGemma 300M (#3129) * model: EmbeddingGemma 300M * Add license and revision * fix: Add dedicated display for RTEB benchmark results (#3089) * feat - remove special filtering, keep zero-shot, keep borda rank * feat - remove get_rteb_benchmark.py * feat - delete get_rteb_benchmark.py;RTEB_BENCHMARK_ENTRIES changes * feat -format * Update mteb/load_results/benchmark_results.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update tasks & benchmarks tables * 1.38.54 Automatically generated by python-semantic-release * dataset: Add Dapfam patent retrieval tasks (#2946) * chore: add 'Patent retrieval' subtype to TaskMetadata * feat(retrieval): add DAPFAM patent retrieval tasks (+18 variants) * Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...) * Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...) * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Changes : - Added possibility to opt in or out of quantization through the "quantize" argument. - Added possibility to compute raw dot product without normalization. (to reproduce the paper method the "similarity" argument should be "cosine"). - Removed unecessary function and overhauled the tasks descriptions to be more clear. * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Changes made : - Overhauled task descriptions as well as naming to conform with the naming scheme of mteb retrieval tasks. - Similarity is now computed using the similarity function of the passed model. - Changed optional quantization method to conform with sentence transformers similarity function. to reproduce the paper metrics, one can use the following snippet : ```python from mteb import mteb from sentence_transformers import SentenceTransformer model_name = "Snowflake/snowflake-arctic-embed-m-v2.0" model = SentenceTransformer(model_name, model_kwargs={ "torch_dtype": "float16", }, trust_remote_code=True, ).cuda().eval() tasks = mteb.get_tasks(tasks=[ "DAPFAMInTitlAbsToTitlAbsClmRetrieval", "DAPFAMAllTitlAbsToTitlAbsClmRetrieval", "DAPFAMOutTitlAbsToTitlAbsClmRetrieval", add the other 3 remaining tasks ... ]) evaluation = mteb.MTEB(tasks=tasks) results = evaluation.run( model, output_folder=f"mteb_res/{model_name}", quantize=True, # if set to false or not set, the obtained ndcg@10 and map@10 will be ~0.001 higher encode_kwargs= {"batch_size" : 32} ) ``` * changed default value of quantization to false * added the import to all DAPFAM tasks; tested that the works; verified compliance with the checklist * Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * added revision numbers to all dataset loading operations as well as the metadata itself * intermediate changes, refresh local branch * intermediate changes, refresh local branch again * scale back to standard evaluation with empty set exclusion, various cosmetic/formatting changes * minor cosmetic/formatting changes * fixed main metric to be ndcg_at_100 as in the paper * removed old code artifacts from previous versions * read appropriate loading arguments from task metadata, remove unecessary class attribute * reformat bibtex ( remark on the assertion since it tries to match literal string instead of bibtex formatting, format inconsistent with arXiv default), fixed metadata, parameters read from task metadata, all tests passed * refactor data loading to read from metadata class attributes --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update tasks & benchmarks tables * Align max tokens (#3172) * Correct the VoyageAI model's batch creation/batch size calculation (#3185) Correct the batch creation * dataset: Adding JapaneseCode1Retrieval as the first non-public dataset (#3168) * Adding JapaneseCode1Retrieval as the first non-public dataset * Transformed dataset * Adding as private dataset to tests * Correct the private task test * Use the sample dataset as a reference * Use the sample dataset as a reference * fix ds loading * allow on forks * upd aciton * remove paths * try to trigger ci * add ref * add permissions * remove paths * add paths back * get back to pull request * rollback action * Trying to resolve the token/secret problem * Trying to resolve the token/secret problem * Update dataset_loading_pr.yml * Update dataset_loading_pr.yml * Try the latest datasets package (worked for me) * Try the latest datasets package (worked for me) * Try the latest datasets package (worked for me) * (last?) try * (last?) try * (last?) try * Reverting the changes * Exclude the private datasets from tests * Apply suggestions from code review --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Solomatin Roman <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: add version check for `embeddinggemma-300m` (#3189) add version check * dataset: Added a set of closed datasets (#3186) * Add 12 more closed datasets Extend the RTEB benchmarks * trust_remote_code * trust_remote_code * Enabling JapaneseCode1Retrieval in the RTEB benchmarks * Add closed datasets as private tasks * Correct due to the comment * Update tasks & benchmarks tables * fix: Edit ack & sponsors (#3187) * dataset: Update FaMTEB to Version 2 (#3157) * Update benchmark to version 2 * make others in benchmark selector one line code * small changes * update a few tasks metadata * update faintent license with correct form * remove redundant trust remote codes * fix hardnegatives revision * make lint * fix errors * apply suggestions * fix citation problem * add PR link to benchmark desc * remove duplicate dataset names in mcinext_models * update prompts --------- Co-authored-by: mehran <mehan.sarmadi16@gmail.com> * Update tasks & benchmarks tables * 1.38.55 Automatically generated by python-semantic-release * fix: Add conflicting dependencies to toml (#3191) fix conflict dependencies * 1.38.56 Automatically generated by python-semantic-release * fix: Correct metadata for ArguAna dataset (#3202) * Update tasks & benchmarks tables * 1.38.57 Automatically generated by python-semantic-release * model: Add BMRetriever (#3195) * model: Add BMRetriever * Update mteb/models/bmretriever_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/bmretriever_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: remove trust_remote_code option * feat: implement BMREtrieverWrapper based on InstructSentenceTransformerWrapper * refactor: update training datasets for bmretriever --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Revert "Ci: test out GH models with welcoming new comers" (#3206) Revert "Ci: test out GH models with welcoming new comers (#3112)" This reverts commit 73a35e0bb02e61108d50385f4c43fd7d1b16e984. * model: Add Codefuse models (#3205) * add codefuse models * add codefuse models * Update codefuse_models.py * lint codefuse.py * fix(models): ensure prompt_type is passed to format_instruction (#3216) * 1.38.58 Automatically generated by python-semantic-release * Adding Cohere's output_dimension and embedding_type parameter (#3204) * Adding Cohere's output_dimension and embedding_type parameter Cohere's embed-v4 binary and int8 * Correcting due to comments * dataset: add swedish cpc patent classifications to mteb (#3072) * feat: add swedish cpc patent classifications to mteb * fix: formatting and init imports * fix: update mteb task according to feedback * fix: perform citation and code formatting * fix: add train and test split for both datasets * fix: AttributeError in ColPaliEngineWrapper similarity method (#3177) * fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior * chore: fix colpali_models similarity handle device * Update tasks & benchmarks tables * 1.38.59 Automatically generated by python-semantic-release * fix: prevent EOS token truncation (#3218) * fix(models): prevent EOS token truncation for BMRetriever * refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper` * fix(models): correct eos token handling in `BMRetrieverWrapper` * 1.38.60 Automatically generated by python-semantic-release * Update giga embeddings (#3210) * update giga embeddings * update giga embeddings * 3b-september-2025 * fixed * lint * Update mteb/models/ru_sentence_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * change revision due to flash-attn dependency * change apply_instruction_to_passages --------- Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> * fix: Refactor split create_tables into static Benchmark methods (#3126) * feat - Split create_tables into static Benchmark methods * feat - format * Update mteb/leaderboard/table.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - remove search query;take benchmark result as input;addressing the circular import, * feat - format * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - use to_dataframe;clean table.py;move creat_table * feat - fix circular import * feat - clean-up * feat - format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 1.38.61 Automatically generated by python-semantic-release * Extending the RTEB benchmark (#3223) Adding another voyageai model * Update tasks & benchmarks tables * model: New qzmodel (#3211) * Update qzhou_models.py * Update qzhou_models.py * reformat script code * Update configuration * According to our new decision, the model name has been changed to "QZhou-Embedding-Zh". * Fix variable naming issues. * model: Update Youtu embedding model (#3227) * add youtu models * add a blank line * fix the optional dependencies and lint the code * remove unused dependencies and reformat * revise prompt_type * update youtu_models --------- Co-authored-by: springxchen <springxchen@tencent.com> * dataset: Add Software Issue Localization Datasets (#3178) * add software issue localization datasets * add software issue localization datasets * update and add multilingual datasets * fix citation format issues * Update mteb/tasks/Reranking/eng/SWEbenchVerifiedReranking.py * fix linting issues --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update tasks & benchmarks tables * feat: Officially include RTEB in the leaderboard (#3222) * feat - adjust Rteb's Benchmark * feat - add blank * fix menu names * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * moving around tasks * fix: Update RTEB summary columns (#3226) * fix(models): ensure prompt_type is passed to format_instruction (#3216) * 1.38.58 Automatically generated by python-semantic-release * Adding Cohere's output_dimension and embedding_type parameter (#3204) * Adding Cohere's output_dimension and embedding_type parameter Cohere's embed-v4 binary and int8 * Correcting due to comments * dataset: add swedish cpc patent classifications to mteb (#3072) * feat: add swedish cpc patent classifications to mteb * fix: formatting and init imports * fix: update mteb task according to feedback * fix: perform citation and code formatting * fix: add train and test split for both datasets * fix: AttributeError in ColPaliEngineWrapper similarity method (#3177) * fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior * chore: fix colpali_models similarity handle device * Update tasks & benchmarks tables * 1.38.59 Automatically generated by python-semantic-release * fix: prevent EOS token truncation (#3218) * fix(models): prevent EOS token truncation for BMRetriever * refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper` * fix(models): correct eos token handling in `BMRetrieverWrapper` * 1.38.60 Automatically generated by python-semantic-release * Update giga embeddings (#3210) * update giga embeddings * update giga embeddings * 3b-september-2025 * fixed * lint * Update mteb/models/ru_sentence_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * change revision due to flash-attn dependency * change apply_instruction_to_passages --------- Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> * fix: Refactor split create_tables into static Benchmark methods (#3126) * feat - Split create_tables into static Benchmark methods * feat - format * Update mteb/leaderboard/table.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - remove search query;take benchmark result as input;addressing the circular import, * feat - format * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/benchmarks/benchmark.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * feat - use to_dataframe;clean table.py;move creat_table * feat - fix circular import * feat - clean-up * feat - format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 1.38.61 Automatically generated by python-semantic-release * Extending the RTEB benchmark (#3223) Adding another voyageai model * Update tasks & benchmarks tables * feat - filter_by_privacy * feat - add new fields for rteb part * feat - getattr * feat - adjust privacy filter logic * feat - enhance summary table column renaming and add 'is_public' field mapping * fix: remove unused 'is_public' attribute from TaskResult --------- Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Atheer <atheer2104@protonmail.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com> Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: smile <smile@pinai.io> Co-authored-by: ethan <smiletoye@gmail.com> * removed show_rteb args * avoid defining function where we can just use the metadata * minor fixes * minor fixes * fix: Correct logic for filtering public tasks in ModelResult class (#3230) Co-authored-by: ethan <smiletoye@gmail.com> --------- Co-authored-by: q275343119 <275343119@qq.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com> Co-authored-by: Yongbin Choi <whybe.choi@gmail.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: Atheer <atheer2104@protonmail.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com> Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru> Co-authored-by: smile <smile@pinai.io> Co-authored-by: ethan <smiletoye@gmail.com> * Update tasks & benchmarks tables * 1.39.0 Automatically generated by python-semantic-release * fix: Add submission references for RTEB (#3233) * fix: Add rteb submission references and improve descriptions. * Added evaluation request * added field for tasks * 1.39.1 Automatically generated by python-semantic-release * dataset: add human tasks and benchmark (#3214) * Human Subsets Tasks * Fixed Multilingual Classification Subset * linting * fix citations format * make lint * fix tests * remove human folder * fix relative imports * add adapted_from for all human subsets * fix pydantic errors * add benchmark object * make benchmark discoverable * bibtex test * Apply suggestion Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * rename & reupload * upd tests * upd tests again * add model * add benchmark to leaderboard * change branch of leaderboard * remove branch of load data * fix model meta path * make mteb importable * update repo * Update mteb/benchmarks/benchmarks/benchmarks.py * Update mteb/leaderboard/benchmark_selector.py * Update mteb/load_results/load_results.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com> * Update tasks & benchmarks tables * Remove 'HUME(v1)' from leaderboard benchmark (#3236) * Remove 'HUME(v1)' from leaderboard benchmark * lint * docs: Update adding benchmark documentation (#3229) * update adding_a_benchmark.md documentation * fix numbers * fix: Further specified macro-language code for Norwegian (#3228) * fix: Further specified macro-language code for Norwegian "nor" is a macro-language code that covers bokmål and nynorsk (both norwegian), but this means that these datasets will be missed if using "nob" or "nno". Specifying it like this should allow this. * furhter specified macro language "nor" * Update tasks & benchmarks tables * 1.39.2 Automatically generated by python-semantic-release * fix max tokens (#3243) * fix python39 transformers compatibility (#3254) * fix python39 transformers * fix * Aggregate by subset for HUMEv1 (#3255) aggregate by subset for HUMEv1 * Update tasks & benchmarks tables * Fix AbsTaskTextRegression task (#3257) Fix AbsTaskTextRegression * Added Japanese to Retrieval (#3252) * feat - add Japanese * feat - use mteb.get_benchmark * fix - 3.9 test error * Revert "fix - 3.9 test error" This reverts commit 6bfee53cff48304cc22d8248aa275dcc9e385475. * fix - 3.9 test error * Update tasks & benchmarks tables * fix bm25 on small datasets (#3261) * fix: Move zero-shot percentage calculation to the end of summary (#3231) * Refactor: Move zero-shot percentage calculation to the end of summary table creation which only apply to RTEB table. * Update RTEB benchmark name from "RTEB(beta)" to "RTEB" for consistency in display. * feat - RTEB(beta) * feat - remove Zero-shot --------- Co-authored-by: ethan <smiletoye@gmail.com> * model: Add ReasonIR (#3221) * model: Add ReasonIR * Update mteb/models/reasonir_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/reasonir_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update n_parameters of ReasonIR Co-authored-by: Niklas <n.muennighoff@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Niklas <n.muennighoff@gmail.com> * fix: Only pin model name and rank (#3263) Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column * 1.39.3 Automatically generated by python-semantic-release * fix: resolve flash-attention dependency issue (#3265) * fix: Only pin model name and rank Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column * fix: resolve flash-attention dependency issue This has been tested and works. fixed Resolve flash-attention dependency issues Fixes #3240 * 1.39.4 Automatically generated by python-semantic-release * fix: Add retry and token counting in Cohere models (#3253) * Retry and token counting in Cohere models * Retry and token counting in Cohere models * Retry and token counting in Cohere models --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * 1.39.5 Automatically generated by python-semantic-release * Align MIEB leaderboards with paper (#3272) * sort by mean task type and use pure rank for MIEB LBs * lint * rename task type column for readability * fix: add prompt for MIRACLRetrievalHardNegatives (#3266) * add prompt for MIRACLRetrievalHardNegatives * add `MIRACLRetrievalHardNegatives.v2` * Update mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * move common metadata to dict --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * Add Regression task mock (#3271) * 1.39.6 Automatically generated by python-semantic-release * fix: Change language for task SlovakMovieReviewSentimentClassification (#3296) * Update tasks & benchmarks tables * 1.39.7 Automatically generated by python-semantic-release * Add english code retriever model (#3302) * Add en code retriever model * fix model_name * Update mteb/models/en_code_retriever.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * correct lint --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * docs: fix typos in `docs/adding_a_benchmark.md` (#3344) * BREAKING: v2.0.0 (#1433) * [v2] Merge…

[ADD] 50 vietnamese dataset from vn-mteb

4bd3b39

Samoed requested changes Aug 1, 2025

View reviewed changes

[UPDATE] task metadata

e87fb0d

Samoed reviewed Aug 2, 2025

View reviewed changes

[UPDATE] import dependencies

5d18eca

KennethEnevoldsen reviewed Aug 2, 2025

View reviewed changes

BaoLocPham and others added 2 commits August 4, 2025 16:19

Merge branch 'embeddings-benchmark:main' into main

e3237aa

[UPDATE] task metadata, bibtext citation

0f7a192

[UPDATE-TEST] test_model_meta

35948bc

Samoed approved these changes Aug 4, 2025

View reviewed changes

BaoLocPham added 2 commits August 4, 2025 15:59

[UPDATE] sample_creation to machine-translated and LM verified

c136fb6

[ADD] sample creation machine-translated and LM verified

b26a507

Samoed changed the title ~~[ADD] 50 Vietnamese dataset from vn-mteb~~ dataset: [ADD] 50 Vietnamese dataset from vn-mteb Aug 5, 2025

This was referenced Aug 7, 2025

model: Add GreenNode Vietnamese Embedding models #2994

Merged

benchmark [ADD] VN-MTEB benchmark and Leaderboard #2995

Merged

KennethEnevoldsen requested changes Aug 7, 2025

View reviewed changes

[REMOVE] default fields metadata in Classfication tasks

9d2a03b

KennethEnevoldsen approved these changes Aug 9, 2025

View reviewed changes

KennethEnevoldsen merged commit 741b022 into embeddings-benchmark:main Aug 9, 2025
9 checks passed


		from mteb.abstasks.TaskMetadata import TaskMetadata

		from ....abstasks import AbsTaskClassification, MultilingualTask

	from ....abstasks import AbsTaskClassification, MultilingualTask
	from mteb.abstasks.AbsTaskClassification import AbsTaskClassification
	from mteb.abstasks.TaskMetadata import TaskMetadata

		class AmazonCounterfactualVNClassification(AbsTaskClassification):
		metadata = TaskMetadata(

Conversation

BaoLocPham commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BaoLocPham commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Aug 4, 2025

Uh oh!

BaoLocPham commented Aug 4, 2025

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

BaoLocPham commented Aug 8, 2025

Uh oh!

Uh oh!

KennethEnevoldsen commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KennethEnevoldsen left a comment •

edited

Loading

BaoLocPham commented Aug 4, 2025 •

edited

Loading